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Abstract 



We study the hypothesis testing problem of detecting a shift between the means of two mul- 
tivariate normal distributions in the high-dimensional setting, allowing for the data dimension 
p to exceed the sample size n. Specifically, we propose a new test statistic for the two-sample 
test of means that integrates a random projection with the classical Hotelling T 2 statistic. 
Working under a high-dimensional framework with (p, n) — ¥ oo, we first derive an asymptotic 
power function for our test, and then provide sufficient conditions for it to achieve greater power 
than other state-of-the-art tests. Lastly, using ROC curves generated from simulated data, we 
demonstrate superior performance with competing tests in the parameter regimes anticipated 
by our theoretical results. 

1 Introduction 

The goal of two-sample hypothesis testing is to determine whether or not two data sets {X\ , . . . , X ni } 
and {Y\, . . . ,Y n2 } have been sampled from the same distribution. In the high-dimensional version 
of this problem, each sample is a vector in MP, and the dimension p may be substantially larger 
than both of the sample sizes n\ and 712 • This situation makes the two-sample problem challenging, 
since the cumulative effect of variance in many variables can "explain away" the correct hypothesis. 
Moreover, the occurrence of high-dimensional two-sample problems is becoming increasingly com- 
mon in application domains such as molecular biology and fMRI [e.g., aaaa. In transcriptomics, 
for instance, p gene expression measures on the order of hundreds or thousands may be used to 
investigate differences between two biological conditions, and it is often difficult to obtain sample 
sizes ri\ and ri2 larger than several dozen in each condition. For problems such as these, classical 
methods may be ineffective, or not applicable at all. Accordingly, there has been growing interest 
in developing testing procedures that are better suited to deal with the effects of dimension [e.g., 



A fundamental instance of the general two-sample problem is the two-sample test of means with 
Gaussian data. In this case, two independent sets of samples {X\, . . . , X ni } and {Y%, . . . , Y n2 } C MP 
are generated in an i.i.d. manner from p-dimensional multivariate normal distributions _/V(^i,X) 
and iV(/X2,S) respectively, where the mean vectors \x\ and [12, and positive-definite covariance 
matrix E y 0, are all fixed and unknown. The hypothesis testing problem of interest is 



The most well-known test statistic for this problem is the Hotelling T 2 statistic, defined by 
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where X := — V^f—i %i an d Y := -f- Yi are the sample means, £ is the pooled sample 
covariance matrix, given by E : = ± £™ii(X, - - X) T + £ E^UK - - Y V \ and we 

define n := ni + n-i — 2 for convenience. 

When p > n, the matrix £ is singular, and the Hotelling test is not well-defined. Even when 
p < n, the Hotelling test is known to perform poorly if p is nearly as large as n. This behavior 
was demonstrated in a seminal paper of Bai and Saranadasa [5( (or BS for short), who studied 
the performance of the Hotelling test under (p,n) — > oo with p/n — > 1 — e, and showed that the 
asymptotic power of the test suffers for small values of e > 0. In subsequent years, a number 
of improvements on the Hotelling test in the high-dimensional setting have been proposed [e.g., 

Due to the well-known degradation of £ as an estimate of S in high dimensions, the line of 
research on extensions of the Hotelling test has focused on replacing £ in the definition of T 2 
with other estimators of S. For instance, BS [5] proposed a test statistic based on the quantity 
(X — Y) T (X — Y), which can be viewed as replacing E with I pX p- It was shown by BS that their 
statistic achieves non-trivial asymptotic power whenever the ratio p/n converges to a constant 
c G (0, oo). This statistic was later refined by Chen and Qin jg] (CQ for short) who showed that 
the same asymptotic power can be achieved without imposing any explicit restriction on the limit 
of p/n. Another direction was considered by Srivastava and Du [a, W\ (SD for short), who proposed 
a test statistic based on (X — Y) J D~ 1 {X — Y), where D is the diagonal matrix associated with 
£, i.e. Da = £jj. This choice ensures that D is invertible for all dimensions p with probability 1. 
Srivastava and Du demonstrated that their test for problem (pQ) has superior asymptotic power to 
the tests of BS and CQ under a particular parameter setting and local alternative when n = 0{p). 
To the best of our knowledge, the procedures of CQ and SD represent the state-of-the-art among 
tests for problem ([T]) 1 with a known asymptotic power function under the scaling (p,n) — > oo. 

In this paper, we propose a new testing procedure for problem (pQ) in the high-dimensional set- 
ting, which involves randomly projecting the p-dimensional samples into a space of lower dimension 
k < min{n,p}, computing the Hotelling test statistic in M. k , and then averaging over the ensemble 
of projection matrices. Working within a high-dimensional framework that allows p/n to tend to a 
positive constant or infinity, we derive the asymptotic power function for our test, and show that 
under certain conditions, it can outperform the tests of BS, CQ, and SD in terms of asymptotic 
relative efficiency. 

From a conceptual point of view, the procedure studied here differs from past work in the 
way that covariance structure is incorporated into the test statistic. The previously described test 
statistics of BS, CQ, and SD are essentially based on versions of the Hotelling T 2 test using diagonal 
estimators of S. Our analysis and simulations show that this limited estimation of S sacrifices power 
when the data variables are correlated, or when most of the variance can be captured in a small 
number of variables. In this regard, our procedure is motivated by the idea that covariance structure 
may be used more effectively by averaging over a projected version of the pooled sample covariance 
matrix. We note that the use of projection-based approaches to two-sample testing and covariance 
estimation have also been considered previously by Jacob et al. [lol j. as well as Clemengon et al. {{J, 
Cuesta-Albertos et al. [Hi], and Marzetta et al. [12J. 

The remainder of this paper is organized as follows. In Section El we discuss the intuition for 

1 The tests of BS, CQ, and SD actually extend somewhat beyond JT]) in that their asymptotic power functions 
have been obtained under data-generating distributions more general than Gaussian, e.g. satisfying simple moment 
conditions. The analysis of CQ also allows for the data-generating distributions to have unequal covariance matrices. 
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our testing procedure, and then formally define the test statistic. Section [3] is devoted to a number 
of theoretical results about the performance of the test. Theorems [U and [2] in Section [3] characterize 
the asymptotic distribution of our test statistic under the null and alternative hypotheses, and 
formulas for critical values and asymptotic power are provided. Theorems [3] and H] in Sections 13.41 
and 13.51 give sufficient conditions for achieving greater power than the tests of CQ and SD in the 
sense of asymptotic relative efficiency. In Section HJ we use simulated data to make ROC curve 
comparisons with the mentioned tests, as well as some recent non-parametric procedures such as 
maximum mean discrepancy (MMD) 13], kernel Fisher discriminant analysis (KFDA) [14], and 
a test based on area-under-curve maximization, denoted TreeRank [9(. These simulations show 
that our test outperforms competing tests in the parameter regimes anticipated by our theoretical 
results. 



Notation. Let 6 := fix — [ii denote the shift vector between the distributions iV(//i,X) and 
iV(/i2,5]), and define the ordered pair of parameters 9 := (<5, S). For a positive-definite covariance 
matrix S, let D a be the diagonal matrix obtained by setting the off-diagonal entries of S to 0, and 
define the associated correlation matrix R := D a 1 ^ 2 T l D (T 1 ^ 2 . Let z\- a denote the I — a quantile 
of the standard normal distribution, and let <3> be its cumulative distribution function. If A is a 
matrix in M pxp , let |||^4||| op denote its operator norm (maximum singular value), and define the 

Frobenius norm |||j4|||^ := \J^2ij~A^j- When all the eigenvalues of A are real, we denote them by 
Amin(^) = \{A) < • • • < Ai(A) = A max (A). We use the notation W m to denote an m x m white 
Wishart matrix, i.e. W m ~ W m (n, I m xm)- The notation f(n) < g(n) or f(n) = 0(g(n)) means 
that there is some absolute constant c 6 (0,oo) such that the inequality f{n) < cg(n) holds for 
all large n. If both /(n) < g(n) and g(n) < f(n) hold, then we write f(n) x g(n). The notation 

f(n) = o(g(n)) means f(n)/g(n) — > as n — > oo. The symbols = and A refer to equality and 
convergence in distribution. 



2 Random projection method 

For the remainder of the paper, we retain the setup for the two-sample test of means (JTJ) with 
Gaussian data given in Section 1. In particular, our procedure can be implemented with p > n 
or p < n, as long as k is chosen such that k < mm{n,p}, with n := ri\ + ri2 — 2. This bound on 
k will be assumed without comment, as well as the condition k — > oo as (n,p) — > oo. In Section 
13.31 we demonstrate an optimality property of the choice k = [n /2j , which is valid in moderate 
or high-dimensions, i.e., p > [n/2\, and we restrict our attention to this case in our comparison 
results (Theorems [3] and [4]) . 

2.1 Intuition for random projection method 

To provide some intuition for our method, it is possible to consider the problem ([T]) in terms 
of a competition between the dimension p, and the "statistical distance" separating Ho and Hi. 
On one hand, the accumulation of variance from a large number of variables makes it difficult 
to discriminate between the hypotheses, and thus, it is desirable to reduce the dimension of the 
data. On the other hand, methods for reducing dimension also tend to bring Ho and Hi "closer 
together" , making them harder to distinguish. Mindful of the fact that the Hotelling T 2 measures 
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the separation of Ho and Hi in terms of the Kullback-Leibler (KL) divergence la, p. 216, eq. 12] 
Dkl(N(/j,i, T l )\\N(fj,2, £)) = ^<5 T £ _1 <5, with 8 = fii — ^, we see that the relevant statistical distance 
is driven by the length of 5. Consequently, we seek to transform the data in a way that reduces 
dimension and preserves most of the length of 5 upon passing to the transformed distributions. 
From this geometric point of view, it is natural to exploit the fact that random projections can 



simultaneously reduce dimension and approximately preserve length with high probability 161 ] . 

If a projection matrix P^ G W kxp is used to project data from M. p to and the classical 
Hotelling T 2 statistic is constructed from the projected data, then the resulting statistic is propor- 
tional to (X - y) T [Pfc(P fc r EP fc )" 1 P fc r ](X - Y). In particular, if k < mm{n,p} and P k is drawn 
from the Haar distribution on the Steifel manifold V Pi /% := {P& G W xk : PjJ Pk = Ikxk}> then 
the matrix P^YiPk is invertible with probability 1, which ensures that this is a well-defined statis- 
tic. Hence, our consideration of the distance-preserving property of random projections naturally 
leads to using the matrix P^{P^ TjP^)" 1 P^ as a surrogate for in the high-dimensional setting. 
Moreover, to eliminate the variability of a single random projection, this idea can be refined by 
computing the average of the matrix P fe (P fc T SP A .)- 1 P fe T over the ensemble P& , to any desired degree 
of precision. The resulting statistic is proportional to (X - Y) r E Pk [P k (P k r T,P k )- 1 P k T ](X - Y), 
and this quantity will be the subject of the remainder of the paper. 2 The key advantage of this 
approach is that the matrix Ep fc [P/ c (P fc r SPfc) _1 P fc r ], retains some information about the off-diagonal 
entries of S, whereas a diagonal estimate makes no essential use of correlation structure. Indeed, 
this intuition is confirmed by our theoretical results, which quantify the relationship between the 
degree of correlation and the power that is gained over statistics based on diagonal estimates of S. 

2.2 Implementation of testing procedure 

For an integer k G {1, . . . ,min{n,p}}, let "K, k denote the Haar distribution on the set of matrices 
{Pk G M. pxk : Pj Pk = Ikxk}- If Pfc is drawn from 'fp^ki independently of the data, then our random 
projection-based test statistic is defined by 

fl := (X - Y) T E Pk [PkiP^Pkr'PlKX - Y). (3) 



For a desired nominal level a G (0, 1), our testing procedure rejects the null hypothesis Ho if and 

n 



only if T? > t a , where 



y n := k/n, and is the 1 — a quantile of the standard normal distribution. The fact that t a 
is an asymptotically level-a critical value follows from Theorem Q] of Section [3l where it is shown 
that Ti is asymptotically normal under Ho- 

In order implement our testing procedure, the user must choose a value for k, as well as a method 
for computing Kp k [Pk(P k T Sp c ) _1 P fe r ]. Our analysis in Section [331 shows that the choice k = \n/2\ 
possesses an optimality property under certain conditions, and our simulation results confirm that 
this is a reasonable default choice. Nevertheless, p-values can be obtained with any choice of 
k G {1,... ,mm{n,p}} using our asymptotic theory. Concerning the computation of the matrix 
EpjP fc (P fc T EP k )~ 1 P fc r ], it is a basic fact that if G G W xk is a random matrix with i.i.d. N(0, 1) 
entries, and if G = QR is a thin QR decomposition [17, p. 230], then Q is distributed according 



The paper also considers the matrix Ep fc [Pk(P^ 1 Pk] , from a different point of view. 
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to ip : k, and satisfies the algebraic identity G(G T Y>G)~ 1 G T = Q(Q T T I Q)~ 1 Q T (see Lemma H] in 
Section TB. II for additional details). Consequently, T? may be computed as an average over matrices 
with i.i.d. N(0, 1) entries, which are computationally inexpensive to generate. However, our analysis 
will make use of the distribution ip t k- 

A second issue is the the number of independent copies of to use when averaging, and 
this choice will depend on the data set at hand, as well as the degree of precision desired in the 
calculation of p-values. To eliminate the effects of finite averaging in the calculation of p-values, 
it is possible to examine the fluctuation of p-values when using a fixed number of projections N, 
and then increase N until the fluctuations become negligible. The stabilization of p-values with 
increasing values of N is illustrated in Figure [5] of Section [H Furthermore, our simulations in 
Figures [3] and H] of Section 0] show that in many cases, the ROC curve of the T? statistic stabilizes 
after averaging over only 30 projections. 

3 Main results and their consequences 

Our first two main results in Theorems Q] and [2] characterize the critical values and and asymptotic 
power function for our statistic T?, and they are proved in Appendices [B] and O Theorems [3] and H] 
to follow provide comparisons of asymptotic relative efficiency with tests proposed in past work, 
and they are proved in Sections 13.41 and 13.51 

3.1 Critical values and asymptotic power function 

As is standard in high-dimensional asymptotics, we consider a sequence of hypothesis testing prob- 
lems indexed by n, allowing the dimension p, sample sizes n\ and 712, mean vectors \i\ and [i<i and 
covariance matrix £ to implicitly vary as functions of n. Although we assume that y n := k/n tends 
to a limit in the interval (0, 1), the derivation of our critical values and asymptotic power function 
does not restrict p/n to tend to a finite limit. 

(Al) There is a constant y € (0, 1) such that y n = y + o(-^). 

To state our result on the asymptotic normality of Tj* under the null hypothesis, we define the 
numerical sequences p, n := jr^-n and a n := yj ^j*™ - ~ $ y/n. 

Theorem 1 (Critical values). Assume that the null hypothesis Ho and condition (Al) hold. Then 
as (n,p) — > oo we have the the limit 

Ttl^ 4 jv(0,1), (4) 

and consequently, the critical value t a satisfies 

P(3jfe > t a ) =a + o(l). 

To consider the alternative hypothesis Hi and obtain an asymptotic power function, we will 
assume that the ratio n\jn tends to a limit in condition (A2), and that a local alternative holds 
in condition (A3). This is the same as the local alternative used by Bai and Saranadasa to 
study the asymptotic power of the classical Hotelling test under (n,p) —> oo. The condition (A3) 
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carries the meaning that the KL divergence between the p-dimensional sampling distributions, 
Dkl(N(ui, S) II N(n2,T,)) = !<5 T £ _1 <5, tends to 0. Similar local alternatives are used elsewhere in 
[1, @, Qj S] with consideration to the high-dimensional problem ([1]) . 

(A2) There is a constant 6 £ (0, 1) such that ^ = 6 + o(^). 

(A3) (Local alternative.) The shift vector and covariance matrix satisfy S T T,~ 1 5 = o(l). 

To set some notation for our asymptotic power result in Theorem O let 9 := (5, T,) denote 
an ordered pair containing the relevant parameters for problem ([T]), and define (3(9) as the exact 
(non-asymptotic) power function of the test T| at level a. We also define A/% as twice the Kullback- 
Leibler divergence between the projected sampling distributions, 



A k := 2 D KL {N(P k ] Ml , P, 1 £P0 || N(P k > /x 2 , P k ] £P fc )J = <5 1 P fc (P fc ' ZP k )~ l P k l 5. (5) 

Note that this quantity is a random variable, due to its dependence on the random matrix P^ . By 
taking an average over P k , we obtain the deterministic quantity 

A fc := E Pk [A fe ] , 

which determines the asymptotic power of T?, according to the following theorem. 

Theorem 2 (Asymptotic power function). Assume conditions (AX), (A2), and (A3). Then, as 
(n,p) —7- oo, the power function (3(9) satisfies 

p(9) = $ (-z X - a + 6(1 - b)^^A k v^) + o(l). (6) 

Remarks. Notice that if A^ = (e.g. under Ho), then <J>(— z\- a + 0) = a, which corresponds 

to blind guessing at level a. Consequently, the second term 6(1 — 6)^4^ A k ^/n determines the 

advantage of our procedure over blind guessing. Since A^ is twice the KL divergence between the 
projected sampling distributions, these observations conform to the intuition from Section 2 that 
the KL divergence measures the discrepancy between Ho and Hi. 

3.2 Asymptotic relative efficiency (ARE) 

Having obtained an asymptotic power function in Theorem [21 we are now in position to provide a 
detailed comparison with the tests of Chen and Qin Q and Srivastava and Du @, 0] • We denote 
the asymptotic power function of our level-a random projection-based test (RP) by 



0RP,*(0) := $ (-Zl-a + 6(1 - b)^-^A k Vn~^. (7) 
The asymptotic power functions for the level-a testing procedures of CQ and SD 0, 0] are given 

by 

fa#):= *(--^+ t V )l p!j; )- and (8a> 

ftnOT :=*(-*-„ + ^^fc), (8b) 
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where D a denotes the matrix formed by setting the off-diagonal entries of S to 0, and R denotes the 
correlation matrix associated to X. The functions /3cq and /3sd are derived under local alternatives 
and asymptotic assumptions that are similar to the ones used here to obtain /Srp^. In particular, 
all three functions can be obtained allowing p/n to tend to an arbitrary positive constant, or to 
infinity. 

A standard method of comparing asymptotic power functions is through the concept of asymp- 
totic relative efficiency, or ARE for short (e.g., see van der Vaart [181 . ch. 14-15]). Since the term 
added to —Zi- a inside the $ function is what controls power, the relative efficiency of tests is 
defined by the ratio of such terms. More explicitly, we define 



With these definitions, whenever the ARE is less than 1, the method proposed in this paper has 
greater asymptotic power than the competing test — with the advantage being greater for smaller 
values of the ARE. As discussed at more detail later, Theorems [3] and H] provide conditions on the 
shift 5 and covariance matrix £ under which the ARE is small. 

For arbitrary values of the parameters £ and 5, the ARE formulas ([9]) are not easy to interpret. 
In order to gain insight into the conditions that determine the relative performance of the RP, CQ, 
and SD tests, we will consider two natural cases where the ARE can be reduced to an interpretable 
form. Since ARE (/?cq//3rp,A;) and ARE (/3sd//3rp,/c) are invariant with respect to scaling of 5, 
the orientation <V||5||2 is the only part of the shift vector that is relevant for comparing power. 
Likewise, it is natural to consider the extreme cases of a "uniform" orientation and an "adversarial" 
orientation. To encode the idea of a uniform shift, for which no direction is of particular importance, 
we will allow ct/ll^lb to follow the uniform (Haar) distribution on the unit sphere of MP. A similar 
assumption was considered by Srivastava and Du 0], who let 5 be a deterministic vector with all 
coordinates equal to the same value, in order to compare with the results of BS To encode the 
idea of an adversarial shift, we will consider a situation in which 5 is likely to point in a direction 
of high variance, i.e. a direction in which the shift is easily explained away by chance variation. 
One way this can be carried out is to allow 5 to follow the distribution iV(0, for some scaling 
factor s n , since the most likely direction is parallel to the eigenvector corresponding to the largest 
eigenvalue of E, and this will be referred to as a tilted shift. We emphasize that our testing procedure 
does not rely on these choices of the distribution of 5, and that our asymptotic power function in 
Theorem [2] is derived under the assumption of fixed and unknown values of 5 and S. 

In classical analyses of asymptotic relative efficiency, the ARE is usually a deterministic quan- 
tity that does not depend on n. However, in the current context, our use of high-dimensional 
asymptotics, as well as a randomly chosen value of 5, lead to an ARE that can be regarded as 
a sequence of random variables indexed by n. To be clear about the meaning of Propositions Q] 
and and Theorems [3] and H] below, we point out that d is the only source of randomness in the 
ARE. We complete the preparation for our comparison theorems by stating Propositions Q] and [21 
which describe the scaling of the power-determining quantity A/% under a uniform shift and a tilted 
shift. These are the main technical tools for our comparison results, and proofs can be found in 
Appendix [Aj Although these scaling results do require some assumptions on the eigenvalues of S, 
these assumptions essentially just amount to requiring that the maximum and minimum eigenval- 




and 



(9b) 



(9a) 
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ues do not converge to or oo too quickly. In particular, we do not require the eigenvalues to be 
bounded away from or oo. 

Proposition 1 (Scaling of with a uniform shift). 

Assume that the shift 5 has a spherical distribution Fg with F$(5 = 0) = 0. Then, the follow- 
ing limit statements hold with (n,p) —> oo. 

ft) V a ■ 7s) k = °(X)> then for any c 6 (0, 1), we have 

m**kh h (10) 

(ii) IfJ2 = Ipxp + r for some matrix T ^ with rank(r) = o(k), then 

,, / > 1 in probability under F '$ . (11) 

Remarks. The case £ = I pX p + T includes a variety of sparsity patterns of the matrix S. For 
instance, if we choose v to be a vector with support set S C {1, . . . ,p}, then using T = vv T gives a 
matrix S whose off-diagonal entries are supported on S x S. By choosing T to be a sum of rank 1 
matrices of this type, more complex block-correlation patterns can be captured. 

Proposition 2 (Scaling of A^ with a tilted shift). 

Assume that the shift 5 has a distribution P5 = N(0, s n S) for some positive scaling factor s n , 
and that |||S||| op /tr(S) = o(l). Then, as (n,p) —> 00, we have 

Afc / k 

1 in probability under P5. (12) 



[2 



tr(S) 



3.3 Choice of projection dimension k — [n/2\ 

We now demonstrate an optimality property of the choice of projected dimension k = \n/2\. 
Letting k/n —> y £ (0, 1), recall that the asymptotic power function from Theorem [2] is 

/WW = * (-zi-a + b(l - b)^^A k ^ . 

Since Propositions [1] and [2] indicate that scales linearly in k up to random fluctuations, we see 

that formally replacing k with yn leads to maximizing the function f(y) := yj 4^ y. The fact that 

/ is maximized at y = 1/2 suggests that k = [n/2\ may be optimal in certain cases. The following 
proposition formalizes this idea by providing conditions under which this choice is optimal in the 
sense of ARE. For this purpose, we define k* := [n/2\, y* := 1/2, and 

A 2 

ARE(/3 RPife //3 RPifc *) := 2 \ . 

1 y a 

2y* A fc* 
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Note that Propositions[T]and[2]provide two classes of examples where the conditions of the following 
result are satisfied (along subsequences). 

Proposition 3. Assume that either of the limits imfe/r-fer -> 1 or ifrnr/- — > 1, hold for some 

J \\S\\i' tr O) 11*112' p 

sequence of parameters 5 and E. Then, for any choice of k with k/n — >• y € (0,1), we /lave i/te 
optimality property 

lim ARE(/3 RP;fc //3 RP;fc *) < 1. 

(n,p)— >oo 

Proof. Under either of the limits, it is straightforward to check that for all y G (0, 1), the numerical 
sequence ARE(/3 R p j fc//3 R p j fc*) tends to 4y(l — y), which is at most 1. □ 

Remarks. The ROC curves in Figure [1] illustrate several choices of projection dimension, with 
k = [_yn\ and y = 0.1, 0.3, 0.5, 0.7, 0.9, under two different parameter settings. We refer to these 
as Setting 5 and 8, as they are described in Section ETTI where full details are provided. In all cases, 
the statistic T| is computed by averaging over 100 projections. 

In Setting 8, the matrix E is constructed with a slowly decaying spectrum, and a matrix of 
eigenvectors drawn from the uniform (Haar) distribution on the orthogonal group (to induce a non- 
diagonal E). Also, the shift vector 5 is drawn from N(0, s n E) with s n = \J tr(E) /2, in accordance 
with the conditions of Proposition [2j As expected from Proposition [2l we see that the ROC curve 
corresponding to y = 1/2 dominates the others up to small fluctuations. 

In Setting 5, the matrix E is constructed to have a block-diagonal form, and the shift vector 
5 is drawn from the uniform distribution on the unit sphere, which results in a set of conditions 
similar to those of Proposition [TJ(ii)]. The ROC curves in Figure [T] show that y = 1/2 is not far 
from optimal among the five choices of y. 




False positive rate False positive rate 

(a) Setting 8 (b) Setting 5 

Figure 1. Setting 5 corresponds to a randomly generated covariance matrix with a slowly decaying 
spectrum, and a shift vector drawn according to N(0, s E) with s = yftr(S)/2. Setting 8 corresponds 
to a block-diagonal covariance matrix with a shift vector drawn from the uniform distribution on the 
unit sphere. We see that the choice k — \_n/2\, corresponding to y = 1/2, is nearly optimal in both 
cases. Full details can be found in Section B~T1 
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3.4 Power comparison with CQ 



We are now equipped to compare the asymptotic power of our test with the test of Chen and Qin 
(CQ) 0] in the following theorem. A proof is given at the end of this section 



Theorem 3. Assume the conditions of either Proposition\l][(i)] (uniform shift) or Proposition® 
(tilted shift), and choose k = \n/2\. Fix a number e\ > 0, and let Ci(ei) be any constant strictly 
greater than 4/ei. Then, as long as the inequality 

n> Cl ( ei )|^£, (13) 
III^IIIf 

holds for all large n, we have the limit P<j [ARE (/?cq//3rp fe) < ei] — > 1 as (n,p) —> oo. 

Remarks. The case of ei = 1 serves as the reference for equal asymptotic performance, with 
values e% < 1 corresponding to the T 2 statistic being asymptotically more powerful than the test of 
CQ. To interpret the result, note that Jensen's inequality implies that the ratio tr(E) 2 / |||S|||^ lies 
between 1 and p, for any choice of S. As such, it is reasonable to interpret this ratio as a measure 
of the effective dimension of the covariance structure (c.f. Examples [land [2] below). 3 The message 
of Theorem [3] is that as long as the sample size n grows faster than the effective dimension (with 5 
being uniformly oriented), then our projection-based test is asymptotically superior to the test of 
CQ. 

The ratio tr(£) 2 / |||£|||^ can also be viewed as measuring the decay rate of the spectrum of S, 
with the condition tr(E) 2 / ||| ||| ^ <C p indicating rapid decay. This condition means that the data 
has low variance in "most" directions in MP, and so projecting onto a random set of k directions 
will likely map the data into a low-variance subspace in which it is harder for chance variation to 
explain away the correct hypothesis, thereby resulting in greater power. 

Example 1. One instance of spectrum decay occurs when the top, say s, eigenvalues of E contain 
most of the mass in the spectrum. When £ is diagonal, this has the interpretation that s variables 
capture most of the total variance in the data. For simplicity, assume Ai = • • • = A s > 1 and 
A s +i = • • • = A p = 1, which is similar to the spiked covariance model introduced by Johnstone [20]. 
If the top s eigenvalues contain half of the total mass of the spectrum, then s Ai = (p — s), and a 
simple calculation shows that 



tr(£) 2 _ 4 A 2 



II 2 

II F 



A? + A; 



s < As. (14) 



This illustrates the idea that condition (|13p is satisfied as long as n grows at a faster rate than 
the effective number of variables s. It is straightforward to check that this example satisfies the 
conditions of (A5) of Theorem [3] when, for instance, Ai = o(k). 

Example 2. Another example of spectrum decay can be specified by Aj(£) oc i~ u , for some absolute 
proportionality constant, a rate parameter v G (0, oo), and i = 1, . . . ,p. This type of decay occurs 
in various sparse signal models, and arises in connection with the Fourier coefficients of functions 



3 This ratio has also been studied as an effective measure of matrix rank in the context of low-rank matrix 
reconstruction [l9| . 
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in Sobolev ellipsoids 21] [22k 7. 2]. Noting that tr(S) x JJ 1 ! v dx and |||S|||^ x fj'x 2u dx, direct 
computation of the integrals shows that 



tr(£) 2 



1 

log 2 p 



if v> 1 
if u = 1 



p2(i-^) ifzyG( i )i; 
p/logp \iv = \ 
[p if f6(0,|; 



Thus, a decay rate given by z/ > 1 is easily sufficient for condition (|13p to hold unless the dimension 
grows exponentially with n. On the other hand, decay rates associated to v < 1/2 are too slow for 
condition (fT3]) to hold when n <C p, and rates corresponding to v G (5, 1) lead to a more nuanced 
competition between p and n. We also point out that the assumptions of Theorem [3] hold for all 
v G (0, 1], when the dimension p must satisfies logp = o(fe). 



The proof of Theorem [3] is a direct application of Propositions [T] and EJ 

2 



Proo/ 0/ Theorem^ Recalling ARE (/3cQ//3RP,fc) 
the event of interest, 



n||5||! 



ARE^cq/fe) <ei, 



with = [n/2\ and y = 1/2, 

(15) 



is the same as 



n 1 



A, 



By Propositions [T] or [21 we know that for any c < 1, the probability of the event 



c k At 
< — ^ 



tr(S) 



(16) 



tends to 1 as (n,p) — > 00. Consequently, as long as the inequality 

2 



1 



< 



ck 



(17) 



holds for all large n, then the event (|15p of interest will also have probability tending to 1. Replacing 
k with S • [1 — o(l)], the last condition is the same as 



n > 



tr(S) 2 



eic 2 [l-o(l)] 2 ' 



(18) 
□ 



3.5 Power comparison with SD 

We now turn to a comparison of the asymptotic power of our test with that of Srivastava and Du 
(SD) @, 0]. Recall that R denotes the correlation matrix. 
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Theorem 4. Fix a number e\ > 0, and let ci(e\) be any constant strictly greater than j-. 

(i) (Uniform shift). Assume ||| Z^^T 1 1|| o ^ / tr(.D~ 1 ) = o(l), and that the conditions of Proposi- 
tion{^[(i)J hold. Then, as long as the inequality 

, s /tr(E)\ 2 {ti(D~ l )\ 2 , , 

• !£ci(ei) (-n (w) (19> 



is satisfied for all large n, we have the limit P [ARE (/3sd//3rp fc) < e i] — 1 as { n ,P) ~^ °°- 

(it) (Tilted shift). Assume %R\ op / tr(R) = o(l), and that the conditions of Proposition^ hold. 
Then, as long as the inequality 

"^''(S) 2 (20> 

is satisfied for all large n, we have the limit P [ARE (/?SD/AiP,fc) < ei] — > 1 as (n,p) °°- 

Remarks. Unlike the comparison with the CQ test, the correlation matrix R plays a large role in 
determining the relative efficiency between our procedure and the SD test. The correlation matrix 
enters in two different ways. First, the Frobenius norm |||i2|||jr is larger when the data variables 
are more correlated. Second, if S has a large number of small eigenvalues, then tr(D~ 1 ) is very 
large when the variables are uncorrelated, i.e. when E is diagonal. Letting UAU T be a spectral 
decomposition of E, with Ui being the zth column of U T , note that (D a )a = ujAui. When the 
data variables are correlated, the vector u\ will typically have many nonzero components, which 
will give (D a )a a contribution from some of the larger eigenvalues of E, and prevent (D a )a from 
being too small. For example, if Ui is uniformly distributed on the unit sphere, as in Example H] 
below, then on average E[(D CT )jj] = tr(E)/p. Therefore, correlation has the effect of mitigating the 
growth of tr(D~ 1 ). Since the SD test statistic [g] can be thought of as a version of the Hotelling T 2 
with a diagonal estimator of E, the SD test statistic makes no essential use of correlation structure. 
By contrast, our T 2 statistic does take correlation into account, and so it is understandable that 
correlated data enhance the performance of our test relative to SD. 

Example 3. Suppose the correlation matrix R £ M pxp has a block-diagonal structure, with m 
identical blocks B G M. dxd along the diagonal: 



R 



( B 



\ B J 



(21) 



Note that p = m ■ d. Fix a number p G (—1,1), and let B have diagonal entries equal to 1, 
and off-diagonal entries equal to p, i.e. B = (1 — p)Idxd + / f 'll T ) where 1 G M rf is the all-ones 
vector. Consequently, R is positive-definite, and we may consider E = R for simplicity. Since 
III-^IIIf = d + Zp 2 ^), and |||-R|||^ = m lll-B|||p'> it follows that 

\\\R\f F = [l + P 2 (d-l)]p. 

Also, in this example we have tr(E) = tr(D~ 1 ) = p and p/d = m, which implies 

tr(E)V ^(D- 1 )^ 2 _ p < m_ (22) 



; " l + p2(d-l) " p 2 ' 
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Similarly, 



tr(12) 



m 



Consequently, we conclude that the sufficient conditions (|19p and (|20p from Theorem [J] are satisfied 
when n grows at a faster rate than the number of blocks m. Note too that the spectrum of E 
consists of m copies of A max (E) = (1 — p) + pd and (p — m) copies of A m i n (E) = 1 — p, which 
means when p is not too small, the number of blocks is the same as the number of dominant 
eigenvalues — revealing a parallel with Example [TJ From these observations, it is straightforward to 
check that this example satisfies the assumptions of Theorem |U The simulations in Section 14.11 give 
an example where R has the form (|2ip . and the variables corresponding to each block are highly 
correlated. 

Example 4. To consider the performance of our test in a case where E is not constructed deter- 
ministically, Section 14.11 illustrates simulations involving randomly selected matrices E for which 
the statistic T| is more powerful than the tests of BS, CQ, and SD. Random correlation structure 
can be generated by sampling the matrix of eigenvectors of E from the uniform (Haar) distribution 
on the orthogonal group, and then imposing various decay constraints on the eigenvalues of E. 
Additional details are provided in Section [4.11 



Example 5. It is possible to show that the sufficient conditions (|19p and (|20p require non-trivial 
correlation in the high-dimensional setting. To see this, consider an example where the data are 
completely free of correlation, i.e., where R = I pX p- Then 



F = y/p, and Jensen's inequality 
tT \Ri — Altogether, this 



implies that tr(Z)- 1 ) > p 2 /ti(D a ) = p 2 /tx( s )> giving 

shows if the data exhibit very low correlation, then condition (|19p cannot hold when p grows faster 
than n. This is confirmed by the simulations of Section [4.11 involving diagonal covariance matrices. 

We now turn to the proof of Theorem HI which makes use of the following concentration bounds 
for Gaussian quadratic forms. 



Lemma 1. Let A El 

Then, for any t > 0, 



and for any t G (0, 



' xp be a positive semidefinite matrix with 
Z T AZ 



I -A III op > 0, and let Z 



N(0,I } 



pxp) 



tr(A) 



> 1 + t 



m\o P 



< exp (-r/2) 



(23) 



tr(A) 



l), we have 



Z T AZ 
tr(A) 



< 



M\o P 

tr(A) 



III All op 
tr(A) 



See Appendix [X] for a short proof of this result 
papers of Bechar and Laurent and Massart [2 



24] 



< exp(-t 2 / 2 )- 



(24) 



These bounds are similar to results in the 
but have error terms involving the operator 



norm as opposed to the Frobenius norm, and hence may be of independent interest. 



Equipped with this lemma, we can now prove Theorem [H 
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Proof of Theorem^ We only give the proof under the conditions of Proposition [21 since the proof 
is essentially the same under the conditions of Proposition H(i)]. Let us define the event of interest 

g n := {ARE(/3 SD //3 RP>fe ) < £l }, where we recall ARE (AWfe) = (^/m)', with 
k = \n/2\ and y = 1/2. The event S n holds if and only if 



— x 'I , nc„ 2 

We consider the two factors on the right hand side of (I25p separately. By Proposition [2j the first 
factor satisfies 

11*112 

TlTiT^ / I 1 in probability under F 5 . (26) 

11*112/ tr ( S ) 



\W 

with Z ~ N(0,I pxp ). Next, using Lemma[H and tr(E 1 / 2 L'~ 1 E 1 / 2 ) = tr(R), we have 



Since the second factor ^ J-i^ m nne dZSJ) is invariant under scaling of 5, we may write 5 = YV 2 Z 



2 ' tr(S) 
8 T D- 1 5/ tr(fl) 
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1 in probability under P5, (27) 



since ^L 2 ^ • = z t s i/1^ s i/2 Z ■ I7§§ ~^ 1 in probability under F s . Consequently, for any 
c G (0, 1), we combine (|26l) and (|27|) to conclude that P (# n ) — >• 1 as long as the inequality 

" '< c f±y»V (28) 



flf.ei - C \tT{K)J \tr(H) 



\F 

holds for all large n. Replacing k with ^ • [1 — o(l)], the last condition is equivalent to 

\\\\ r \\\fJ eic[l-o(l)\ 2 

□ 

4 Comparisons with simulated data 

In this section, we compare our procedure to a broad collection of competing methods on simulated 
data, illustrating the effects of the different factors involved in Theorems [3] and HI 



4.1 Description of models 

We generated ROC curves using mulitvariate normal data, sampled under 10 different choices of 
the parameters S and E. For each ROC curve, we sampled n\ = n<i = 50 data points from each 
of the distributions iV(//i,E) and A r (/ii2,E) in p = 200 dimensions, and repeated the process 500 
times with 5 = fi± — ^2 = under Ho, and 500 times with 5^0 under Hi. Hence, each ROC curve 
reflects the results of 1000 two-sample problems. We organize the 10 parameter settings according 
to 5 different choices of E, and 2 different choices of 5, as displayed in the following table. 
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Setting # 


Choice of £ 


Choice of S 




diagonal 


random 


slow decay 


fast decay 


block stucture 


uniform 


tilted 


1 


X 




X 






X 




2 


X 






X 




X 




3 




X 


X 






X 




4 




X 




X 




X 




5 










X 


X 




6 


X 




X 








X 


7 


X 






X 






X 


8 




X 


X 








X 


9 




X 




X 






X 


10 










X 




X 



For each two-sample problem under a uniform shift, the vector 5 was drawn independently 
as Zj\Z\i with Z ~ N(0, I pxp ), and for each problem under a tilted shift, the vector 5 was 
drawn independently as 2 E 1 / 2 Z'/||E 1//2 Z||2, so as to conform to the conditions in Theorems [3] and 
[D Similarly, for cases where £ was drawn randomly, an independent copy was drawn for each 
two-sample problem, as described below. 

To describe the 5 choices of £, let UAU T denote a spectral decomposition of E. A diagonal 
choice of £ corresponds to U = I pX p, and a random choice of £ corresponds to drawing U from the 



uniform (Haar) distribution on the orthogonal group (see the paper 25J] for sampling details). Note 
that U = Ipxp results in independent variables, whereas a random U leads to correlated variables. 
To consider two rates of spectral decay, which we refer to as slow and fast, we selected p equally 
spaced eigenvalues Ai, . . . , A p between 0.01 and 1, and raised them to the power 15 for fast decay, 
and the power 6 for slow decay. We then added 0.001 to each eigenvalue to control the condition 



number of £, and rescaled them so that |||£|||^ = y A 2 + • • • + A 2 , = 50 (fixing a common amount of 
variance in both the slow and fast cases). Plots of the resulting spectra are shown in Figure [2j The 
last setting involves a sparse matrix £ with 40 small groups of highly correlated variables, and we 
refer to this as the 6/ocfc-structured case. This case does not depend on the previously described 
choices of U or A. Specifically, the block-structured choice of £ was constructed using 40 identical 
blocks B G M 5x5 along the diagonal, with the diagonal entries of B equal to 1, and the off-diagonal 
entries of B equal to p := 0.8 (c.f. Example [3]). 

4.2 ROC curve comparisons 

In addition to our random projection (RP)-based test, we implemented the methods of BS [5], SD 
0], and CQ 0], which are all designed specifically for problem (fTJ) in the high-dimensional setting. 
For the sake of completeness, we also show comparisons with two recent non-parametric procedures 



that are based on kernel methods: maximum mean discrepancy (MMD) [13j], and kernel Fisher 
discriminant analysis (KFDA) [l^], as well as a test based on area-under-curve maximization, 
denoted TreeRank Note that the curves labeled RP-1, RP-30, and RP-100 correspond to 
computing the statistic T| by averaging over 1, 30, and 100 matrices Pk with i.i.d. N(0, 1) entries. 

Overall, the ROC curves in Figures [3] and H] show that the RP and SD tests perform the best 
within this collection of procedures, with RP test having an advantage for correlated variables 



15 



o 
o 
o 



o 




o fast decay 
+ slow decay 



^ i i i i 

O 50 100 150 200 

Index 

Figure 2. Plots of two sets of eigenvalues Ai,...,A p , with slow and fast decay, both satisfying 

\\Til\p — \J ^\ + • • • + Ap = 50. To interpret the number of non-negligible eigenvalues, there are 29 

eigenvalues greater than a tenth of A max (E) in the case of fast decay, and there are 65 eigenvalues 
greater than a tenth of A max (E) in the case of slow decay. 



and the SD test having an advantage for independent variables. Furthermore, the plots indicate 
that essentially no additional power is gained by averaging the T| statistic with more than 30 
projections. 

On a qualitative level, Figures [3] and H] reveal some striking differences between our procedure 
and the competing tests. Comparing independent variables versus correlated variables, i.e. panels 
(a) and (b), with panels (c) and (d) in both figures, we see that the tests of SD and TreeRank 
lose power in the presence of correlated data. Meanwhile, the ROC curve of our test is essentially 
unchanged when passing from independent variables to correlated variables. Similarly, our test also 
exhibits an advantage when the correlation structure is prescribed in a block-diagonal manner in 
panel (e) . The agreement of this effect with Theorem 2] is explained in the remarks and examples 
after that theorem. Comparing slow spectral decay versus fast spectral decay, i.e. panels (a) and 
(c) , with panels (b) and (d) , we see that the competing tests are essentially insensitive to the change 
in spectrum, whereas our test is able to take advantage of low-dimensional covariance structure. 
The remarks and examples of Theorem [3] give a theoretical justification for this observation. To 
offer a more quantitative assessment of the ROC curves in terms of Theorems [3] and HI Table 1 
below summarizes approximate values of the ARE-determining quantities in those results. 

The fluctuations of 100 p-values calculated from a single dataset by averaging over different 
numbers of projections is illustrated in Figure [5] below. The dataset was drawn according to 
Setting 4. In particular, we see that the order of magnitude of the p-values is stable when 100 
projections are used for averaging. When the number of projections is increased to 10 4 , all 100 
p-values lie strictly between 0.022 and 0.029. We remark that the T? statistic can be computed by 
averaging over 10 4 projections in about 30 seconds using R code on a basic laptop. 
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(b) Setting (2): diagonal E, fast decay 
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(c) Setting (3): random E, slow decay (d) Setting (4): random E, fast decay 




O ! ! ! ! 

0.0 0.2 0.4 0.6 0.8 1.0 

False positive rate 



(e) Setting (5): block covariance 

Figure 3. (Uniform shift) ROC curves of several test statistics for five different settings of E with 
the shift vector S drawn uniformly from the unit sphere: (1) Diagonal covariance / slow decay, (2) 
Diagonal covariance / fast decay, (3) Random covariance / slow decay, and (4) Random covariance 
/ fast decay, (5) Block covariance. 
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(e) Setting (10): block covariance 

Figure 4. (Tilted shift) ROC curves of several test statistics for five different settings of £ with the 
shift vector S drawn from -/V(0, s£) with s = y / tr(E)/2 : (6) Diagonal covariance / slow decay, (7) 
Diagonal covariance / fast decay, (8) Random covariance / slow decay, and (9) Random covariance 
/ fast decay, (10) Block covariance. 
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Quantity 


diagonal E, 


diagonal E, 


random E, 


random E, 


block 




slow decay 


fast decay 


slow decay 


fast decay 


structure 


tr(E) 2 /|||E||[ 2 F 


54 


25 


54 


25 


56 


\ V ! \ > 


4.6 x 10 5 


3.5 x 10 5 


58 


30 


56 


tr{Rf/\\Rf F 


200 


200 


55 


26 


56 



Table 1. Approximate values of the quantities tr(E) 2 / |||E||| 2 , (^) ( 'p^ ) > an d tr(R) 2 / \\R\f F , 
for the five choices of E in the simulated data experiments. For cases involving a randomly drawn 
E, the quantities are obtained as the rounded average of 500 draws. Theorems [3] and H] assert that 
these quantities determine the relative performance of our test with CQ and SD respectively. The 

value of f ^r^) (lpf]p) 1S ver ^ l &r S e m t ne case 01 diagonal E, since ti^D^ 1 ) involves reciprocals 
of many small eigenvalues. This effect disappears once correlation is introduced. 
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Figure 5. Using a single dataset drawn according to the conditions of Setting 4, we multiplied 
the shift S by a factor of 35 to bring the p-values into an interesting range. For each choice of 
N = 10 2 , 10 3 , 10 4 , we calculated 100 p-values with the statistic f| by averaging over N projections. 



5 Conclusion 

We have proposed a novel testing procedure for the two-sample test of means in high dimensions. 
This procedure can be implemented in a simple manner by averaging over projected versions of 
the classical Hotelling T 2 test. In addition to deriving the asymptotic power function for this test, 
we have provided interpretable conditions on the covariance and correlation matrices for achieving 
greater power than competing tests in the sense of asymptotic relative efficiency. Specifically, our 
theoretical comparisons show that our test is well-suited to interesting regimes where the data 
variables are correlated, or where most of the variance can be captured in a small number of 
variables. Furthermore, in the realistic case of (n,p) = (98,200), these types of conditions were 
shown to correspond to favorable performance of our test with several competitors in ROC curve 
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comparisons on simulated data. Extensions of this work may include more refined applications of 
random projection to other high-dimensional testing problems. 
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A Proofs of scaling results for (Propositions [I] and [2]) 

Before proceeding to the proofs of Propositions [1] and El we prove Lemma [TJ which was stated 

in the main text. We also state and prove Lemmas [2] and [31 concerning the eigenvalues of 
EpJP fc (P fe T S p fc) -l P T]. 



Proof of Lemma [TJ Note that the function f(Z) := V Z T AZ = \\A 1 I 2 Z\\2 has Lipschitz constant 
A\\\ with respect to the Euclidean norm on W. By the Cirel'son-Ibragimov-Sudakov inequality 



(30) 



for Lipschitz functions of Gaussian vectors [26J, we have for any s > 0, 

P[/(Z)<E[/(Z)]- a ]<exp(3j3£- 



From the Poincare inequality for Gaussian measures 27], the variance of f(Z) is bounded above 
as var[/(Z)] < |||^4||| op . Noting that E[/(Z) 2 ] = tr(^4), we see that the expectation of f(Z) is lower 
bounded as 

E[f(Z)} > ^tr(A) - \\\A\\\ op . 
Substituting this lower bound into the concentration inequality (j30|) yields 

< exp 



f(Z) < Jtr(A) 



op 



Finally, letting t G 



trM) 



1 ) , and choosing s 2 = t 2 |||^4||| op yields the claim ([2 



To obtain the second bound (|24p in Lemma[H we use the upper-tail version of the inequality ([30 



namely P [f(Z) > Ef(Z) + s] < exp 



2III.4I 



for s > 0. By Jensen's inequality, we have 



E[f(Z)] = EVZ T AZ < JE[Z T AZ] = y/ti(A), 



from which we obtain 
yields the claim (|23j) . 



f(Z) > 7tr^4) + s 



< exp 



2|A| 



and setting s 2 = t 1 



op 



for t > 
□ 



Lemma 2. For any positive definite matrix T, £ M pxp , we have the upper bound 

1 



E Pfe [P fc (P fc ' SP^P, 



< 
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Proof. Using Jensen's inequality, and the fact that the matrix P k (P k T T l Pk) 1 P k T has the same 
non-zero eigenvalues as the matrix (P^ T,P k )~ l P k T Pk = (P& SPfc) -1 , we have 



Ep fc [P fc (P fc ' £P fe )-\P ; 



op 



[Pk^pk] 



-1 



op 



(31) 



Since |||(P& SPfc) 1 ||| op is the reciprocal of A m i n (P fc r EP&), we use the variational characterization 



of the minimum eigenvalue to obtain the lower bound 

Aminl-Pfc 1 '^Pk 



inf x 1 P, 1 SP fc x 
ll*[|a=l 



> inf y T Sy • inf ||i\x||i 
IM|2=i IfI|2=i 

= A m i n (S) • 1. 



(32) 



□ 



Lemma 3. For any positive definite matrix S E R pxp , we have the lower bound 

tr(E P jP fe (pT£P fc )-ipT]) >k-^ y 

Proof. Combining the cyclic property of trace with the identity Pj P k = Ikxk yields 

trCEpJP^pTEPfc)-^]) = E P Jtr(P fc T SP fc )- 1 )]. 
Letting {Aj}^ =1 denote the eigenvalues of P^jEPfc, we then have 

fc fc i= i A * Li=i x i 

where we have applied Jensen's inequality to the function f(t) = 1/t, which is convex on the 
positive real line. Applying the same argument with the expectation Ep fc , we obtain 

itr(Ep fc [P fe (P fe T SP fc )- 1 pT])>EpJA ; /tr(P fc T SP fc )] > - ,p Tvp , r (33) 

lHp fe [tr^ fc l^f k )\ 

Let £ = UAU T be a spectral decomposition of E. Since Pt follows a unitarily invariant distribution, 
P^TSPfc =d P^jAPfc, and the ith diagonal entry of this matrix can be written as uj Aui, where U{ 
is the ith column of P&. Since Ui is uniformly distributed on the unit sphere of MP, the expected 
value of the square of any coordinate of U{ is equal to 1/p. Hence, TK[uJ Aiii] = tr(S)/p, and so 
Ep fc [tr(P fc T £P fc )] = fctr(E)/p. ' □ 

A.l Proof of Proposition [H 

Proof of part (i). We may represent <5/||<5||2 as Z/\\Z\\2 for a standard p-dimensional Gaussian 
vector Z. Let P = E Pfc [P k {P£Y,P k )- 1 P£] so that 

A fc Z T PZ 

|TFj|2 _(i 117112 • 



2 \\ Z \\2 
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Clearly, \Z\fejp — > 1 almost surely, and so it suffices to show that for any c < 1, 



Z ] BZ ck , 

> —=zr -> 1. 



P tr(S)_ 

To see this, first note that if the condition |||-B||| op / tr(f?) = o(l) holds, then Lemma [1] guarantees 

F (Z-BZ > c_ M B 1 \ i 

V p p ) 

Since Lemma [3] ensures that tr ^ > ^^j, it remains to verify |||-B||| op / tr(£?) = o(l). By LemmaH 
lll-^lllop — a anc ^ a S a i n i by Lemma O tr(£?) > k • ( tr fg\ )- Consequently, our assumption 

1 tr(£) i i \ ■ 
i r^n — ). — =o(l) gives 

Amin(S) k p \ i o 

ll|B|ll ^^U-^-(i). 



tr(B) " A min (£) • k p 

□ 



Proof of part (ii). Let r := rank(r). Since T ^ we may write V = V r V r , where V r G 

»T 

k 



and rank(V r ) = r. Let 5 r := PuV r € M kxk . By the Woodbury formula 



p^p k ) = (Pj:(i pxp +T)p k 

hxk+PkVrV r T P k 



(34) 

kxk — B T ( I kxk + i? r ' B r ) . 



Hence, 



A k = 8 T P k (I kxk -C r )P^6. 



= :G 

Based on the preceding calculations, it can be verified that for almost every P k , the matrix G has 
operator norm 1, and that at least k — r eigenvalues of G are equal to 1. Consequently, 

PpjG]||| op ^ i 



<t— : = o(l), 



tr(E P JG]) " k-r 
and then using the representation 5/||<5||2 =d Z/\\Z\\2, Lemma CD gives 

A fe /tr(E Pfc [G]) 



, 1 in probability under Fg. 
6\\p p 



To simplify tr(Ep fc [G]), observe that for almost all P k , the non-zero eigenvalues of C r are at most 
1, since they are of the form A/(l + A) where A is a non-zero eigenvalue of the positive semi-definite 
matrix Bj B r . Furthermore, because C r has rank r, we have tr(Ep fc [C r ]) = o(k), and so 

tr(E Pfc [G]) = E Pfe [tr(J fcxfe - C r )\ = k - E Pk [tv(C r )} = k • (1 - o(l)), 
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which implies that 

A fc Ik 



,, , >■ 1 in probability under Fg. 

2' P 



□ 



A. 2 Proof of Proposition [21 

It is sufficient to show that the limits A k /k — > 1 and tr(S) — > 1 hold in probability under F$. 

Since Afc/||<5||2 is invariant under scaling of 5, we may write 5 = Y}/ 2 Z with Z ~ N(0,I pxp ), and 
then Lemma [U immediately gives the second limit. To prove the first limit, we will show that A k /k 
converges to 1 in L 2 (Fs). Expanding out the definition of A k , we find 

A k = E Pk [Z T Y}l 2 P k (P^P k )- l P^Y}l 2 Z] . 

Notice that for any fixed realization of P k , the matrix M := S 1 / 2 Pfc(P fc r SPfc) _1 P fc r S 1//2 , is idempo- 
tent with rank k. Hence, conditional on P k , the variable Z T MZ has a x? distribution, which does 
no depend on P k . By integrating over 5 before P k , it follows that Ks(A k /k) = 1, and consequently, 
it is enough to show that var,5 ( A^/k) = o(l). By the law of total variance, and using the fact that 
tr(M) = k and HIMIH^ = k for each realization of P k , we have 



vara ( A k /k) = —E M [vai z [Z T MZ\M]] + var M [E Z [Z T MZ\M]] 



- k 2 ' °' 

which is an o(l) quantity as required. 



^E M [var z [Z^MZ\M}]+^ 2 

pE M [2|||Af||||.] + pvaxM[tr(M)] 
2k 



B Proof of Theorem [T] 

Let t := an d Z ~ N(0, I pxp ). Under the null hypothesis that 5 = 0, we have X - Y = /FE 1 / 2 ^, 

and consequently, 

Tl = Z T S^EpJP^PjEP)-^^ E 1 / 2 Z, 

V ' 

= :A 

which gives the convenient representation = Z T AZ. Note that A is a random matrix, and that 
we may take X — Y and E to be independent for Gaussian data [2S|, p. 77]. Consequently, we may 
assume that Z and A are independent. Our overall strategy is to work conditionally on A, and use 
the representation 

(^;N.)=E,p i ( zTB -s,| J 



23 



where i£l, To demonstrate the asymptotic normality of AZ, we will show in Section TB. 41 that 
the limit |||-A||| op / |||^4|||^ = op A (l) holds. This implies the Lyupanov condition [29| . which in turn 
implies the Lindeberg condition, and it then follows [30, Sec. 3.2] that 4 



sup 



z 



*-j*^L< x \A)-*{z) =op a (1). (35) 



The next step is to show that tr(^4) and |||^4|||^ can be replaced with deterministic counterparts 
fin '■= n an d o n '■= \ TTZTr V-i V™- More precisely, we show in Sections IB . 21 and [B . 31 (respectively) 
that 

tr(A) — fl n = Of A (\/n), and (36a) 
F - d n = ov A {\fn). (36b) 

Substituting the previous formulas into the limit (|35p . it follows that 



< x \ A I - $(x) 



Z T AZ - fl n 

and the central limit theorem (J3|) follows from the dominated convergence theorem. □ 
B.l Matrix Reduction of A. 

Before dealing directly with tr(^4) and |||>1||| in Sections IB.2[ IB.3I and IB.41 we will need a lemma 
that reduces the matrix A := E 1//2 Ep f , [P^(P^ r EP)~ 1 . Pj | E 1 / 2 to a simpler form. In particular, let 
us recall the definition of the thin QR factorization [17, p. 230]: namely, if M € MP xk is full rank 
with k < p, then there exist unique matrices Q 6 M. pxk and R £ M. kxk such that M = QR, where 
Q has orthonormal columns, and R is an invertible upper triangular matrix with positive diagonal 
entries. 

Lemma 4. Let € M. pxk be a random matrix that is full rank with probability 1, and let X 1 / 2 ^ = 
QR be a thin QR factorization. Then we have 

A = nE Q [Q(Q T W p Q)- 1 Q T ], (37) 
where W p ~ W p (n, I pX p) is a white Wishart matrix that is independent of Q. 

Since £ = S^TUpE 1 / 2 , this claim ((37]) follows after substituting the formula P k = E~^ 2 QR into 
the definition of A and making use of several cancellations. 

B.2 Asymptotic formula for tr(A) 

In this section, we verify the limit ()36ap . Define the sequence of random variables X n := -^(tr(A) — fi n ). 
It suffices to show that E|X n | -4 0. Using the cyclic property of trace, in conjunction with equa- 
tion (f3"7"j) and Q T Q = Ikxk, we obtain tr(yl) = nE,Qltr((Q T W p Q)~ r )]. Using this relation, we 
have 

E|X„| = E [tr((Q T IU p Q)- 1 )] 



4 The uniformity of the limit (|35|) will be useful later on in Section [3TT 
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By Jensen's inequality, we obtain an upper bound by extracting Eq from the absolute value, and 
then we exchange Ejy and Eq to obtain 



E|X n | < ^Q^w P tr((Q T W p Q)~ 1 ) 



Vn 



1-3/n 



Since Q and W p are independent, we may fix Q without affecting the distribution of W p , and we 
also note that for a fixed Q, the matrix Q T W P Q is distributed as a white Wishart matrix Wk whose 
distribution does not depend on Q. Consequently, if we take the expectation over W p first, then 
the expectation over Q becomes irrelevant, and we have 

Vn 



1 



Uu 



By Lemma 2.2 in the paper @], the variable Y n := y/n(^tr(Wf, 1 ) — 1 ^ j tends to in probability, 

and so it remains to check that the sequence Y n is uniformly integrable to ensure that E\Y n \ = o(l), 
which will complete the proof. By Theorem 2.4.14 in [3ll . p. 257], mean and variance of Y n are both 
0(1), which means that the sequence Y n is bounded in L 2 , and hence uniformly integrable. 



B.3 Asymptotic formula for m^m^ 

In this section, we verify the limit (|36b|) . Define the sequence of variables X n := (|||^4||| F — a n ). 
It suffices to show that E|X n | — > 0. First observe that by the reduced formula (j37|) for A, we have 



E 



E 



Q 



Q(Q T W P Q)- L Q 



Vn 



I V (1 " Vn) 3 

Applying Jensen's inequality twice, and exchanging the order of Eq and Ew yields 



EIXJ < E n E 



QiQ 1 WpQ)-^ 



Vn 



(1 " Vn) 



Since the non-zero eigenvalues of Q(Q T W P Q) 1 Q T are the same as those of {Q T W P Q)~ 1 , these 
matrices have the same Frobenius norm, and so 



E\X n \ < E Q E Wp 



(Q T W P Q)- 1 



Vn 



(1 " Vn? 



Applying the same reasoning as in the previous subsection, we may replace (Q T W P Q)~ 1 with 
Wk ~ Wfc(n, Ikxk) when working under Ejy ■ , and then the expectation over Q becomes irrelevant. 
Putting together the pieces, we find 



EIXJ < E 



n\\\W, 



(1 " Vr, 



By Lemma 2.2 in the paper [5J, the random variable Y n := v^lll^A: IIIf — y (l-a) 3 tends to 
in probability, and so it remains to check that the sequence Y n is uniformly integrable to ensure 
that E|l^| = o(l), which will complete the proof. By Theorem 2.4.14 in [311 ]. it follows that 



\W, 



integrable. 



-llll 2 



0(1), which implies that the sequence Y n is bounded in L 2 , and hence uniformly 
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B.4 Verification of 



op 



IU\ 



F proven in Section fB.31 shows that \\\A\\\ F tends to a positive 



The asymptotic formula for 

constant in probability under P4. Consequently, to verify |||-A||| / |||^4|||^ = op A (l), it is enough to 
show that lll^lllop tends to in probability, and this is implied by the following lemma. 

Lemma 5. The matrix A satisfies -^E\\A\\ op — > 0. 

Proof. By Jensen's inequality and the fact that the matrices Q{Q T W P Q)~ 1 Q T and {Q 1 W p Q)~ l 
have the same non-zero eigenvalues, 



±E\\\A\\\ op = E Wp 



^q[Q(Q t w p q)- 1 q 
< e Wp e q \q{q t w p qy 1 q j 

= E Q E Wp (Q T W P Q)- 1 



op 



op 



(38) 



op 



Adjusting by the factor of 1/y/n, we have 

1 



E 



n 



op 



< 



op 



For any e G (0, c*), define the constant c* := (1 — -y/y) 2 , and the event E 1 := {A m i n (^ Wk) > c* — e}. 
We now split the expectation E^ fc ||| n ^fc" 1 || a ^ on S & an< ^ -^ c > 



4-Ew, 



\nWi 



lop 



yn "fc 



br 1 -is 

I K I II Op 



Ira, WVT 1 III -lgc 

I fe III op a 



On the set E, the random variable |||^W fc 1 ||| op is bounded by the constant l/(c* — e), and so the 
first piece tends to due to the factor of -7=. Next, we upper bound the second piece in several 
steps. First we replace the operator norm with the Frobenius norm, and then use the Cauchy- 
Schwarz inequality in conjunction with the formula Ey/ k IH^aT 1 )!^ = C(l/n) from Theorem 2.4.14 
of 0, P- 256]. 



n 



\nWZ L \\\ l E - 

I K 1 1 1 Op ^ 



<^E Wk 



11 



< 



E 



u,. \\\nW^\\\l-y/¥(E* 



(39) 



< 



E c ). 



Since A m i n (^ Wk) — > c* almost surely [32j, we have F(E C ) — > 0, which proves the claim. 



□ 
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C Proof of Theorem [2] 

Working under the alternative hypothesis, we seek the limiting value of the power function j3(0) = 
W(h T? > t a ). Since 8^0, the test statistic splits into three terms. Define r := n ^ + ™ 2 , and recall 

fl = (1/r) • (X - y) T E P jP fe (P fe T MP)- 1 P fc T ](X - Y), 

where 

X-Y = ^ 1/2 Z + 5, 

and Z is a standard Gaussian p- vector Z. Expanding the definition of Ti 2 , and adjusting by a factor 
of -, we have the decomposition ^ Tj 1 = I + II + III, where 

I := ±Z T AZ (40a) 
II := 2^—Z T A^- 1 / 2 5 (40b) 

III := — 5 T Y 1 - l/2 AT l - 1/2 5. (40c) 
jit 

Here the reader should recall the definition of the matrix A from Appendix [B] — namely 

A := S 1/2 E Pk [P k (P^EP)- 1 P^] £ 1/2 . 
With these definitions in hand, we will work conditionally on A and consider 

f T 2 > t a ) = E A F z (l > ±i Q - II - III | A) . 
Studentizing with tr(^A) and |||^;^4||| F , and multiplying top and bottom by 



n, 



T 2 >f \ _ EA¥; ( ZT { ±A)Z-tr(±A) ^(l ta -tr(^)-II-IIl) I \ , , 

T k > t a ) - E A F Z { ^ > ^ii^ii | A) . (41) 



Recall the definition of the critical value t a := j^-n + J ^ ^Jn Z\- a , and define the numerical 
sequence t n := — I n _ r k \_ 1 J Afc. In Sections IC.ll and IC.2|, we establish the limits 



Pz(V^|H| > e |AJ = op a (1), and ^(III-^) = o Pa (1) 
where e > 0. Substituting the limits (|36ap and (|36bp into equation (|i"Tj) yields 



T k > t a ) = E A F z | > — * r^r= - + op A (l)M). i 4 ^ 

A (l-Hn) 3 



By the limit (|35 1> . it follows that 
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where the error term op A (1) on the right hand side is bounded by 1. Integrating over A and applying 
the dominated convergence theorem, 

^(t 2 > t a ) = *( - + y/ny/^^in) + 0(1). 

Using the assumptions y n = ^ = a + o(-^), and ^- = 6 + o(-^), we conclude 
P(T| >t Q ) = $(-^ + 6(1-6)-^^ • A fcv ^)+o(l). 

□ 



C.l The cross term II tends to 0. 

In this section, we prove that under the conditions Theorem [2J for any e > 0, we have the limit 



It suffices to show that E^P^f \y/nll\ > e \ A 



V^H| >e \A) =op a (1). (43) 

P^|\/nII| > = o(l). Since \/nll has mean 

0, i.e. E^E^ [-^/nll] = E^[0] = 0, Chebyshev's inequality implies that the last condition will hold if 
varf-y/nll] = o(l). By the law of total variance, we have 

var[VnII] = E A var z [ynll | A] + var A E z [y/nil \A]], 

where we note the the second term is 0. Writing out the first term, we have 

var[\/nll] = E^var^ \/nll | A 



— S T E- 1 / 2 A 2 'E~ l / 2 d 



nr 
4 

nr 



< 



nr 



E Q E Wp 



Eq [Q{Q T ^ W P Q)~ X Q T ]TT 1 / 2 8 
Q{Q Tl n W p Q)- 1 Q Y Y l - 1 l 2 5 



where the last step follows by Jensen's inequality. We then use iterated expectation to write 

var[v^H] = —E Q \^TT x f % Q E Wp [(Q T ± W P Q)-^ Q T XT 1/2 <5 
4 



— E c 
nr 



S^^-^Q (c n I kxk )Q T ^ 1 / 2 5 



where the last step uses the fact that Ew/ fe [(4Wfc) l ] = c n Ikxk with c n 
maximum singular value of Q is one, we have 



71— fc— 1 ' 



Since the 



var^II] = ^E Q \ Q T ^ 2 S *] < ^\\^/ 2 5\\ 2 
tit L 2 J nr 

= 0(5 T Z- 1 5), 
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where the final step uses the facts that c n = 0(1) and — = 0(1). This completes the proof, since 
5 T T,~ 1 5 = o(l) by our local alternative assumption. 



C.2 Limiting value of III. 

Before we study the limiting value of III, we record an elementary lemma. 

Lemma 6. Let Wk ~ Wk(n, Ikxk) be a white kx k Wishart matrix, and let Xn-k+i ^ e a chi-square 
variable on n — k + 1 degrees of freedom. If u G M fc is a fixed unit vector, and n > k + 3, then 

1.. * 1 



which has mean 



n—k—l 



and variance 



(n-fc-l) 2 (n-fc-3) ' 



Proof. By Corollary 2.4.13.1 in 3l|, p. 256], the random variable (u'W^u)' 1 is distributed accord- 
ing to W\{n — k + 1, (u J Ikxku)^ 1 ), which is the same as the chi-square distribution on n — k + 1 
degrees of freedom. □ 



In order to give the limiting value of III, we now define some notation. Let v := S 1//2 <5, and 
recall the definition £ n := ^ ( w _fc_i ) Tracing back through the definition of Q in Lemma [U 
it is straightforward to verify that E ( g||(5 T t'||2 = and so we may also write 



■n 



Tn\n — k 

which will turn out to be the asymptotic equivalent of III. We now show that under the conditions 
of Theorem [2] 

v^(lH-0 =o Pa (1). 

Recalling the definition III := ■±-6 T 'E- 1 / 2 AEr 1 / 2 6, we define the sequence of random variables 
V n := y^O-H — ^n)- Since 



E[A) =nE Q [QE Wp [(Q T W p Q)- 1 ]Q T ] = E Q [Q- 



n 



-hxkQ 



n 



-E Q [QQ T ], (44) 



n—k—l J n—k—l 

we see that E[V n ] = 0. Consequently, it suffices to show that var[K„] = E[V^] — > 0. Note that 



y n 2 = n I — v 1 t\ 
\ nT 



Q 



nQ[Q'W p Q) Q 



-1 



v-L 



To obtain an upper bound on the second moment, we apply Jensen's inequaity, and switch the 
order of integration, thereby obtaining 

2i 



EWJ 



nE w 



<n J E n E 



1 



-E 



nr 



Q 



nv T Q (Q T W P Q) 1 Q t v ' 



n 



rn Vn — k — 1 



nr 



nQ [Q WpQ) Q 



n 



rn V n — k — 1 



\\Q T v 



T „||2 
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Since W p and Q are independent, we may assume that Q is fixed when working under Ew p . Define 
the unit vector u = ||qt Jj| 2 > an d recall that for a fixed Q, the matrix Q T W P Q is distributed as ~ 
Wk(n, Ikxk)- Moving \\Q T v ||| between Eg and Ew fe , and moving ^ outside both expectations, we 
have 



E[y n 2 ] < 



(nr) 



2 E 



Since the expectation under E^ fc does not depend on Q, we may factor the right hand side into a 
product of two expectations. After moving the factor of n 3 in front of Ey^ fe , this gives 



u T W k - l u 



n — k — 1 



By the lemma, uJW, 1 u ~ -5-^ — , which has mean - \ 1 and variance 7 — ; — ttIt — ; — 1^. Conse- 

J ' k xi-k+i n-k-l (n-k-iy(n-k-3) 

quently 



n 3 E w , 



(• T ^-^ri)1- w 



We also recall that t^tt 

(nr)- 3 



6 2 (1 — 6) 2 , and that under our local alternative, 



E Q [\\Q T v\\ 4 2 ] < \\v\\i = (5 T i:- 1 5) 2 l 
which completes the proof that E^] — )■ 0. 
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